Add health check to watch.stream for silent connection drops#2525
Add health check to watch.stream for silent connection drops#2525Urvashi0109 wants to merge 1 commit intokubernetes-client:masterfrom
watch.stream for silent connection drops#2525Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Urvashi0109 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Adds an optional “health check” mechanism to Watch.stream() intended to detect silent watch connection drops (e.g., during control plane upgrades) by configuring timeouts and retrying from the last observed resource_version.
Changes:
- Introduces
_health_check_intervalparameter inWatch.stream()and handlesReadTimeoutError/ProtocolErrorto trigger reconnects. - Auto-populates
_request_timeoutfrom_health_check_intervalwhen not explicitly provided. - Adds unit tests covering reconnect behavior, default behavior, timeout propagation, and request-timeout argument handling.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| kubernetes/base/watch/watch.py | Adds _health_check_interval handling, sets timeouts, and retries on read/connection errors. |
| kubernetes/base/watch/watch_test.py | Adds tests validating reconnect + timeout configuration/compatibility. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| # If health check is enabled, treat a read timeout as a | ||
| # silent connection drop and allow the outer while loop | ||
| # to reconnect using the last known resource_version. | ||
| if health_check_interval > 0: | ||
| pass # Fall through to retry logic below |
| except (ReadTimeoutError, ProtocolError) as e: | ||
| # If health check is enabled, treat a read timeout as a | ||
| # silent connection drop and allow the outer while loop | ||
| # to reconnect using the last known resource_version. | ||
| if health_check_interval > 0: | ||
| pass # Fall through to retry logic below | ||
| else: | ||
| raise |
| # Verify _request_timeout was set to the health check interval | ||
| fake_api.get_namespaces.assert_called_once_with( | ||
| _preload_content=False, watch=True, | ||
| timeout_seconds=10, _request_timeout=30) |
| # Verify the user's _request_timeout (60) was preserved, not overridden | ||
| fake_api.get_namespaces.assert_called_once_with( | ||
| _preload_content=False, watch=True, | ||
| timeout_seconds=10, _request_timeout=60) |
| if health_check_interval > 0 and '_request_timeout' not in kwargs: | ||
| kwargs['_request_timeout'] = health_check_interval |
What type of PR is this?
/kind bug
/kind feature
What this PR does / why we need it:
When running a watch on Kubernetes objects (e.g., Jobs, Pods, Namespaces) and the Kubernetes control plane gets upgraded, the watch connection is silently dropped. The watcher hangs indefinitely - No exception is raised and no new events are received. This is because the TCP connection enters a half-open state where the client believes the connection is still alive, but the server side has been torn down during the upgrade.
This PR adds a
_health_check_intervalparameter towatch.stream()that detects silent connection drops and automatically reconnects:_health_check_intervalis set to a value > 0, a socket-level read timeout (_request_timeout) is configured on the HTTP connectionurllib3raises aReadTimeoutErrorresource_version, ensuring no events are missed_health_check_interval=0), preserving full backward compatibilityReadTimeoutErrorpropagates to the caller as beforeWhich issue(s) this PR fixes:
Fixes #2462
Special notes for your reviewer:
_request_timeout) to break out of the blocking read, then catching the resultingReadTimeoutError/ProtocolErrorexceptions_health_check_intervalfollows the existing convention in this codebase (e.g., _preload_content, _request_timeout) for parameters that are consumed by the client library rather than passed to the API serverDoes this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: